Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
(The text below is copied from the HAR website.)
Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).
Read more: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har#ixzz5GisaTutu
f1 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
f2 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
pml_training.csv <- read.csv(file = f1, header = TRUE )
pml_testing.csv <- read.csv(file = f2, header = TRUE )
rm(f1, f2)
## remove columns with subject variables (the first 7)
training <- pml_training.csv[ , -c(1:7) ]
testing <- pml_testing.csv[ , -c(1:7) ]
## remove columns with NAs
NA_training <- which(colSums(is.na(training)) == 0)
NA_testing <- which(colSums(is.na(testing)) == 0)
# There are more columns with NAs in the test set, hence we take this list
# to clean both train and test set
training <- training[, NA_testing]
testing <- testing[, NA_testing]
For people unfamiliar with this visualisation: you can click on a cell in the heatmap and then the scatterplot will reveal the variables that reference the cell clicked.
# names(grep("factor", sapply(testing, class), value=TRUE))
# names(grep("factor", sapply(training, class), value=TRUE))
### cor heatmap + scatterplot of selected
iplotCorr(training %>% select(-classe), reorder = TRUE)
## Set screen size to height=700 x width=1000
fit_tree <- train(classe ~ . , method = "rpart", data = training)
rpart.plot(fit_tree$finalModel)
confusionMatrix(fit_tree)
## Bootstrapped (25 reps) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction A B C D E
## A 25.4 9.0 8.4 7.7 3.3
## B 0.7 6.0 0.7 2.7 2.4
## C 1.9 3.2 7.4 3.9 3.3
## D 0.3 0.8 0.6 1.7 0.6
## E 0.2 0.4 0.3 0.3 8.8
##
## Accuracy (average) : 0.4941
rpart.plot(fit_tree$finalModel)
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
library(parallel)
detectCores(); detectCores(logical = FALSE)
## [1] 4
## [1] 2
cluster <- makeCluster(detectCores()-1) # convention to leave 1 core for OS
registerDoParallel(cluster)
fitControl <- trainControl(method = "cv", number = 5, allowParallel = TRUE)
fit_rf <- train(classe ~ . , method = "rf", data = training, trControl = fitControl)
stopCluster(cluster)
registerDoSEQ()
confusionMatrix(fit_rf$finalModel$predicted, training$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 5574 17 0 0 0
## B 4 3774 8 0 1
## C 1 5 3403 23 6
## D 0 1 11 3190 6
## E 1 0 0 3 3594
##
## Overall Statistics
##
## Accuracy : 0.9956
## 95% CI : (0.9945, 0.9964)
## No Information Rate : 0.2844
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9944
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9989 0.9939 0.9944 0.9919 0.9964
## Specificity 0.9988 0.9992 0.9978 0.9989 0.9998
## Pos Pred Value 0.9970 0.9966 0.9898 0.9944 0.9989
## Neg Pred Value 0.9996 0.9985 0.9988 0.9984 0.9992
## Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2841 0.1923 0.1734 0.1626 0.1832
## Detection Prevalence 0.2849 0.1930 0.1752 0.1635 0.1834
## Balanced Accuracy 0.9989 0.9966 0.9961 0.9954 0.9981
The out of sample error for the Random Forest Model is: 0.4 percent
# print(fit_rf$finalModel)
pred_rf <- predict(fit_rf, newdata = testing)
print(pred_rf)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
Several models were tested, two were selected for reporting as they provide insight. The tree model has limited accuracy, while the random forest model is very good in predicting.